House Prices Solution

This competition challenges you to predict the final price of each home with 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa.

In this notebook : ** Quick EDA ** -> ** Data cleaning ** -> ** Train machine learning regression algorithms to predict ** -> ** Submission **

Let's Start!

In [1]:
#Import Required Libraries
import warnings
warnings.filterwarnings("ignore")

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
print(os.listdir("../input"))
['house-prices-advanced-regression-techniques']

Quick EDA

In [2]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv') 
test  = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
In [3]:
train.shape
Out[3]:
(1460, 81)
In [4]:
train.head()
Out[4]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [5]:
import pandas_profiling
profile_report = pandas_profiling.ProfileReport(train)
profile_report
Out[5]:

Data Cleaning

In [6]:
# Dropping rows where the target is missing
Target = 'SalePrice'
train.dropna(axis=0, subset=[Target], inplace=True)
In [7]:
# Combine Test and Training sets to maintain consistancy.
data=pd.concat([train.iloc[:,:-1],test],axis=0)

print('train df has {} rows and {} features'.format(train.shape[0],train.shape[1]))
print('test df has {} rows and {} features'.format(test.shape[0],test.shape[1]))
print('Combined df has {} rows and {} features'.format(data.shape[0],data.shape[1]))
train df has 1460 rows and 81 features
test df has 1459 rows and 80 features
Combined df has 2919 rows and 80 features
In [8]:
data.head()
Out[8]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 2 2008 WD Normal
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 5 2007 WD Normal
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 9 2008 WD Normal
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 2 2006 WD Abnorml
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 12 2008 WD Normal

5 rows × 80 columns

In [9]:
# Dropping unwanted columns
data = data.drop(columns=['Id'],axis=1)
In [10]:
# Looking for Missing Values

def missingValuesInfo(df):
    total = df.isnull().sum().sort_values(ascending = False)
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100, 2)
    temp = pd.concat([total, percent], axis = 1,keys= ['Total', 'Percent'])
    return temp.loc[(temp['Total'] > 0)]

missingValuesInfo(train)
Out[10]:
Total Percent
PoolQC 1453 99.52
MiscFeature 1406 96.30
Alley 1369 93.77
Fence 1179 80.75
FireplaceQu 690 47.26
LotFrontage 259 17.74
GarageCond 81 5.55
GarageType 81 5.55
GarageYrBlt 81 5.55
GarageFinish 81 5.55
GarageQual 81 5.55
BsmtExposure 38 2.60
BsmtFinType2 38 2.60
BsmtFinType1 37 2.53
BsmtCond 37 2.53
BsmtQual 37 2.53
MasVnrArea 8 0.55
MasVnrType 8 0.55
Electrical 1 0.07
In [11]:
# Missing Value Handling

def HandleMissingValues(df):
    # for Object columns fill using 'UNKOWN'
    # for Numeric columns fill using median
    num_cols = [cname for cname in df.columns if df[cname].dtype in ['int64', 'float64']]
    cat_cols = [cname for cname in df.columns if df[cname].dtype == "object"]
    values = {}
    for a in cat_cols:
        values[a] = 'UNKOWN'

    for a in num_cols:
        values[a] = df[a].median()
        
    df.fillna(value=values,inplace=True)
    
    
HandleMissingValues(data)
data.head()
Out[11]:
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 60 RL 65.0 8450 Pave UNKOWN Reg Lvl AllPub Inside ... 0 0 UNKOWN UNKOWN UNKOWN 0 2 2008 WD Normal
1 20 RL 80.0 9600 Pave UNKOWN Reg Lvl AllPub FR2 ... 0 0 UNKOWN UNKOWN UNKOWN 0 5 2007 WD Normal
2 60 RL 68.0 11250 Pave UNKOWN IR1 Lvl AllPub Inside ... 0 0 UNKOWN UNKOWN UNKOWN 0 9 2008 WD Normal
3 70 RL 60.0 9550 Pave UNKOWN IR1 Lvl AllPub Corner ... 0 0 UNKOWN UNKOWN UNKOWN 0 2 2006 WD Abnorml
4 60 RL 84.0 14260 Pave UNKOWN IR1 Lvl AllPub FR2 ... 0 0 UNKOWN UNKOWN UNKOWN 0 12 2008 WD Normal

5 rows × 79 columns

In [12]:
# Check for any missing values
data.isnull().sum().sum()
Out[12]:
0
In [13]:
#Categorical Feature Encoding

def getObjectColumnsList(df):
    return [cname for cname in df.columns if df[cname].dtype == "object"]

def PerformOneHotEncoding(df,columnsToEncode):
    return pd.get_dummies(df,columns = columnsToEncode)

cat_cols = getObjectColumnsList(data)
data = PerformOneHotEncoding(data,cat_cols)
data.head()
Out[13]:
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... SaleType_New SaleType_Oth SaleType_UNKOWN SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 60 65.0 8450 7 5 2003 2003 196.0 706.0 0.0 ... 0 0 0 1 0 0 0 0 1 0
1 20 80.0 9600 6 8 1976 1976 0.0 978.0 0.0 ... 0 0 0 1 0 0 0 0 1 0
2 60 68.0 11250 7 5 2001 2002 162.0 486.0 0.0 ... 0 0 0 1 0 0 0 0 1 0
3 70 60.0 9550 7 5 1915 1970 0.0 216.0 0.0 ... 0 0 0 1 1 0 0 0 0 0
4 60 84.0 14260 8 5 2000 2000 350.0 655.0 0.0 ... 0 0 0 1 0 0 0 0 1 0

5 rows × 311 columns

In [14]:
data.shape
Out[14]:
(2919, 311)
In [15]:
#spliting the data into train and test datasets
train_data=data.iloc[:1460,:]
test_data=data.iloc[1460:,:]
print(train_data.shape)
test_data.shape
(1460, 311)
Out[15]:
(1459, 311)
In [16]:
# Get X,y for modelling
X=train_data
y=train.loc[:,'SalePrice']

Predictive Modeling

In [17]:
from sklearn.linear_model import RidgeCV

ridge_cv = RidgeCV(alphas=(0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
ridge_cv.fit(X, y)
ridge_cv_preds=ridge_cv.predict(test_data)
In [18]:
import xgboost as xgb

model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.2)
model_xgb.fit(X, y)
xgb_preds=model_xgb.predict(test_data)
[07:01:28] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
In [19]:
predictions = ( ridge_cv_preds + xgb_preds )/2

Submission

In [20]:
#make the submission data frame
submission = {
    'Id': test.Id.values,
    'SalePrice': predictions
}
solution = pd.DataFrame(submission)
solution.head()
Out[20]:
Id SalePrice
0 1461 111614.470560
1 1462 160072.839645
2 1463 179114.453715
3 1464 187362.455391
4 1465 202247.130090
In [21]:
#make the submission file
solution.to_csv('submission.csv',index=False)

Credits